Deep Relevance Ranking Using Enhanced Document-Query Interactions

2018-10-11

本文提出了几种针对于文档相关性排序问题的新模型，这些模型基于已有的DRMM模型：A Deep Relevance Matching Model for Ad-hoc Retrieval) 。不同之处在于，DRMM模型使用上下文无关的term encoding编码方式，而本文提出的改进模型则借鉴自PACRR的思想，融合n-grams和不同方式编码的上下文信息。实验证明，本文提出的模型优于BM25-baseline，DRMM和PACRR。
code link

Introduction

作者首先介绍了文档相关性排序问题的定义，也即本文研究的主要对象：

Document relevance ranking, also known as ad-hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using the query and the text of each document only.

针对于 ad-hoc retrieval 问题，现有的深度模型可以分为两种（参考自
A Deep Relevance Matching Model for Ad-hoc Retrieval）：

representation-focused model ：构建一个深度神经网络来获得文本的向量表示，进而做匹配，例如DSSM , C-DSSM and ARC-I
interaction-focused model ：首先在两段文本上构建local interactions (i.e., local matching signals)，然后使用深度神经网络去学习分级匹配模式（hierarchical interaction patterns for matching）

One is the representation-focused model, which tries to build a good representation for a single text with a deep neural network, and then conducts matching between the compositional and abstract text representations. Examples include DSSM [12], C-DSSM [23, 8] and ARC-I [11]. The other is the interaction-focused model, which first builds local interactions (i.e., local matching signals) between two pieces of text, and then uses deep neural networks to learn hierarchical interaction patterns for matching. Examples include DeepMatch [17], ARC-II [11] and MatchPyramid [19].

本文主要研究的是第二种模型中的Deep Relevance Matching Model (DRMM) ，下图是DRMM的结构图。

针对于query和document中的每个词，DRMM 使用预训练的词向量。首先对query中的每个词与doc所有词计算相似度（论文中使用了cos距离），此处没有使用位置信息，而是将一个query对应的所有相似度进行分级（即文中说的直方图：document-aware q-term encoding）。

直方图是指：例如将余弦相似度[-1, 1]分为五个区间{[-1,-0.5), [-0.5,-0), [0,0.5), [0.5,1), [1,1]} 。给定query中的一个词“car”以及一篇文档（car, rent, truck, bump, injunction, runway）, 得到对应的局部交互空间为(1, 0.2, 0.7, 0.3, -0.1, 0.1)，最后我们用基于计数的直方图方法得到的直方图为[0,1, 3, 1, 1]

原始的Deep Relevance Matching Model (DRMM) 论文中，直方图的生成方式有三种：

Count-based Histogram (CH): This is the simplest way of transformation as described above which directly takes the count of local interactions in each bin as the histogram value.
Normalized Histogram (NH): We normalize the count value in each bin by the total count to focus on the relative rather than the absolute number of different levels of interactions.
LogCount-based Histogram (LCH): We apply logarithm over the count value in each bin, both to reduce the range, and to allow our model to more easily learn multiplicative relationships. (实际的实验证明LCH效果最好)

得到直方图后，将其组合成一个向量输入到一个全连接网络中，得到一个query词针对于一个doc的相关性分数。然后使用Term Gating Network得到权重分布（不同的词重要程度不同），在原始论文中，作者直接使用了softmax计算权重：

其中，$w_{g}$代表weight vector，与$x_{i}^{(q)}$维度相同，是一个训练参数；$x_{i}^{(q)}$是query中的每个词的embedding。原始论文尝试了两种方案：

Term Vector (TV): Inspired by the work where term embeddings can be leveraged to learn the term weights in queries, we use query term vectors as the input of the gating function. In this method, $x_{i}^{(q)}$ denotes the i-th query term vector, and $w_{g}$ is a weight vector with the same dimensionality of term vectors.
Inverse Document Frequency (IDF): An important signal of term importance in ad-hoc retrieval is the inverse document frequency. We also tried this simple but powerful signal in the gating function. In this method, $x_{i}^{(q)}$ denotes the inverse document frequency of the i-th query term, and $w_{g}$ reduces to a single parameter.

得到权重分布后，对每个query词针对于一个doc的相关性分数计算加权和得到最终的相关性得分：
$$s=\sum_{i=1}^{M}g_{i}z_{i}^{L}$$

简要的说明一下训练过程：

Since the ad-hoc retrieval task is fundamentally a ranking problem, we employ a pairwise ranking loss such as hinge loss to train our deep relevance matching model. Given a triple $(q,d^{+},d^{-})$, where document $d^{+}$ is ranked higher than document $d^{-}$ with respect to query q, the loss function is $$\zeta(q,d^{+},d^{-};\Theta)=max(0,1-s(q,d^{+})+s(q,d^{-}))$$
where $s(q, d)$ denotes the predicted matching score for $(q, d)$ , and $Θ$ includes the parameters for the feed forward matching network and those for the term gating network. The optimization is relatively straightforward with standard back-propagation. We apply stochastic gradient descent method Adagrad with mini-batches (20 in size), which can be easily parallelized on single machine with multi-cores. For regularization, we find that the early stopping strategy works well for our model.

DRMM模型的缺点在于完全忽略了每个词的上下文信息，而一些新的position-aware模型例如 PACRR (Hui et al., 2017) 和基于循环神经网络的模型(Palangi et al., 2016)则考虑到这点。

本文的主要贡献在于给DRMM模型增加了上下文信息。因为原始DRMM模型中直方图的构建方式并不是可微的，所以DRMM不支持端到端的训练方式。本文总共提出了以下几种基于DRMM模型的拓展方法：

PACRR-like convolutional n-gram matching features (x3.1)
context-sensitive term encodings (x3.2)
query-focused attention-based document representations (x3.3)
pooling to reward denser term matches and turn rich term representations into fixed-width vectors (x3.4)
multiple views of terms, e.g., context sensitive, insensitive, exact matches (x3.5)

作者在BIOASQ生物医药QA问答集 (Tsatsaronis et al., 2015) and TREC ROBUST 2004 (Voorhees, 2005)做测试，结果表明改进后的模型比效果优于BM25-based baselines (Robertson and Zaragoza, 2009), DRMM, and PACRR.

_很值得一看，相当于总结了深度学习在检索中的各种应用，是一个很好的概述。_

Previous Model

DRMM

Introduction中已经介绍了DRMM的整体结构，这里对一些细节再做详细解释。

对于Figure 2中的Term Gating Network，本文作者认为是一种线性self-attention：

其中，$\phi _{g}(q_{i})$是第i个query词语的词向量$e(q_{i})$或者IDF $idf(q_{i})$，$w_{g}$是一个weight vector（训练参数）。目的是基于每个词的重要程度来赋予不同的权重。作者发现$\phi _{g}(q_{i})=[e(q_{i});idf(q_{i})]$对于所有DRMM-based模型是最优的。

原始DRMM模型中最核心的是相似度直方图，原始论文中选取的是固定的相似度分数范围，导致固定维度的全连接网络输入，直方图的计算方式会使DRMM模型不受不同文档和query长度的影响，这是DRMM模型的核心优点。缺点是不能考虑词序和上下文。

PACRR

PACRR模型则提出了一种新的计算相似度分数的方法，以下为其结构：

首先根据每一个query term embedding和doc term embedding得到余弦相似度得分矩阵，矩阵的维度为$|q| \times |d|$，
矩阵元素(i, j)代表第i个query词语和第j个document词语的余弦相似度分数。为了保证对于不同长度的query和doc矩阵维度相同，论文中采取了以下方式：
- 对于不同长度的query : 直接选取所有queries的最大长度，然后zero-padding =>矩阵的行固定为$l_{d}$；
- 对于不同长度的doc：本文使用PACRR-firstk模型，即选择一个固定的k值，只取相似度矩阵的前k列；如果$k>|d|$，采取zero-padding。
上述方法能够得到一个固定维度大小的相似度矩阵，然后分别使用卷积大小为 $n \times n(n=2,…,l_{g})$的卷积核去捕获n-grams的query-document similarities，对于每一类 $n \times n$ 的卷积核，使用$n_{f}$个卷积核进行二维卷积。为了保证不同的的卷积核大小输出的结果维度一致，设置卷积步长为1，调整padding使得输出的矩阵大小相同（与原始的相似度矩阵相同）。最终卷积后的输出为：对于每一类 $n \times n$ 的卷积核，对应的输出结果为 $n_{f}$ 个 $l_{q} \times l_{d}$ 的矩阵；总共有 $l_{g}-1$ 类卷积核，所以共有 $(l_{g}-1) \times n_{f}$ 个输出矩阵，矩阵的维度都是 $l_{q} \times l_{d}$ 。
之后经过两个pooling层。首先是对the dimension of the filters做max-pooling，也即对每一类 $n \times n$ 大小的卷积核，取$n_{f}$个激活图的最大值（channel dimension），max-pooling后的输出是 $l_{g}-1$ 个矩阵，矩阵维度为 $l_{q} \times l_{d}$ ；最后将原始的相似度矩阵直接拿过来堆叠在一起，所以最终输出的是 $l_{g}$ 个矩阵，矩阵维度为 $l_{q} \times l_{d}$ 。之后是对 $l_{g}$ 个矩阵的每行（query dimension）进行k-max pooling，目的是获得每个query term与所有的doc terms最强的前k个关联信息特征，得到的输出为 $l_{g}$ 个矩阵，矩阵维度为 $l_{q} \times k$ 。
然后将上一步中的 $l_{g}$ 个矩阵按照query dimension拼接在一起，得到一个矩阵，维度为 $l_{q} \times (l_{g}k)$，再加入softmax归一化后的query IDF列向量，最终得到 $l_{q} \times (l_{g}k+1)$ 的矩阵。
- 在原始的PACRR模型论文中，得到 $l_{q} \times (l_{g}k+1)$ 的矩阵之后，使用一个LSTM神经网络，以 $l_{q} \times (l_{g}k+1)$ 矩阵的每一行作为LSTM的one step input，LSTM最终输出query与doc的相关性得分 $rel(q, d)$。
- 而本文采取了一种新的方式（来自于Co-PACRR: A context-aware neural IR model for ad-hoc retrieval）：直接将$l_{q} \times (l_{g}k+1)$ 的矩阵拼接成一个向量，然后再输入到全连接网络中，最终得到query与doc的相关性得分 $rel(q, d)$。

与DRMM模型相比，PACRR模型最大的优点在于通过n-grams的卷积对上下文建模，尽管获取的上下文信息不够直接（作者认为直接对terms embedding进行卷积比对相似度得分矩阵能够得到更强的上下文信息），而且PACRR模型是端到端可训练的。

New Relevance Ranking Models

PACRR-DRMM

Fig 3.B是PACRR-DRMM模型的结构图，与PACRR模型不同的是，第四步并没有将all the (document-aware) q-term encodings矩阵拼接成一个向量，而是使用相同的MLP网络独立地计算每一个q-term encoding（矩阵的每一行）的分数，再通过一个线性层得到query与doc的相关性得分 $rel(q, d)$ 。

与DRMM相比，PACRR-DRMM没有采用其gating mechanism方式，而是直接将IDFs的值拼接到term encoding，再通过MLP计算q-term的分数，这相当于是另一种term-gating的方式（作者认为这是一种shortcut passing on information about the terms, here their IDFs, to upper layers）。

实验证明，PACRR-DRMM效果优于PACRR，作者认为可能是因为前者的MLP网络具有更少的参数（MLP层数相同的情况下），更不容易过拟合。

Context-sensitive Term Encodings

DRMM和PACRR模型都使用的是预训练的词向量作为term encoding，这种方式没有考虑到query term的上下文。

信息检索领域，context-sensitivity有两种方式：

term proximity : 侧重于上下文，the context around a term match is also relevant
term dependency：侧重于n-grams，aims to capture when multiple terms (e.g., an n-gram) must be matched together

在接下来介绍的模型中，都是通过编码器对q-term和d-term进行上下文编码，使用的是BILSTM网络，然后在每个位置上拼接前向和后向的LSTM的隐层状态。同时，将原始的term embedding与隐层状态结合，相当于残差结构，如下所示：

其中，$\underset{h(t_{i})}{\rightarrow}$ 和 $\underset{h(t_{i})}{\leftarrow}$ 为顶层LSTM的隐层状态，$e(t_{i})$ 是term的词向量，$c(t_{i})$ 是第i个term的encoding。作者也尝试过将这些向量输入到MLP中得到一个encoding，但实验效果并没有改善。

与PACRR模型相比，最大的区别在于Context-sensitive Term Encodings直接对term encoding进行上下文建模，计算的相似度分数已经包含上下文信息，而PACRR模型对相关性分数矩阵（直接由词向量得到）做n-grams卷积得到上下文信息。

ABEL-DRMM（Attention-Based ELement-wise DRMM）

ABEL-DRMM是在DRMM模型的基础上将原始DRMM模型中的直方图计算变为Attention，使得从term输入到最后计算query-document相关性得分整个过程可微，可以进行端到端的训练。结构如下：

首先计算每个q-term相对于所有d-terms的注意力分数（内积dot-product attention）：

其中：$c(t)$是t的上下文编码（Eq. 1）
然后对d-terms encoding进行加权求和：
接着使用 L2正则化后的 $d_{q_{i}}$ 与L2正则化的q-term encoding进行Hadamard运算(element-wise multiplication)，得到document-aware q-term encoding，代替DRMM模型中的直方图：

Other element-wise functions can also be used to compare $d_{q_{i}}$ to $c(q_{i})$, instead of the Hadamard product (Eq. 4). For example, a vector containing the squared terms of the Euclidean distance between $d_{q_{i}}$ and $c(q_{i})$ could be used instead of $\phi _{H}(q_{i})$. This change had no effect on ABEL-DRMM’s performance on development data. We also tried using $[d_{q_{i}};c(q_{i})]$ instead of $\phi _{H}(q_{i})$, but performance on development data deteriorated.

和DRMM模型一样，ABEL-DRMM模型与文档长度无关，但是可以端到端训练。但是两种模型都不能 reward higher density matches.

Still, both models do not reward higher density matches.

POSIT-DRMM (Pooled Similarity DRMM)

与ABEL-DRMM不同，POSIT-DRMM使用余弦相似度计算每个q-term相对于所有d-terms的注意力分数：
然而在ABEL-DRMM模型中并没有使用加权求和，而是直接将所有的$a_{i,j}$直接拼接在一起：
之后通过两个pooling层，max-pooling返回的是 $q_{i}$ 与doc最相关匹配的信息，而avg-pooling返回的是 $q_{i}$ 与doc前top-k最相关匹配的平均信息：

其中 $\phi _{P}(q_{i})$ 即为document-aware q-term encoding，代替DRMM模型中的直方图。

POSIT-DRMM模型比其他模型的参数更少，因为全连接网络的输入维度只是一个长度为2的向量，而ABEL-DRMM与term embedding的长度相同。因此，POSIT-DRMM模型的全连接网络只需要一层，更多训练超参数见 Deep Relevance Ranking Using Enhanced Document-Query Interactions Supplementary Material 。

POSIT-DRMM is closely related to PACRR (and PACRR-DRMM). Like POSIT-DRMM, PACRR first computes cosine similarities between all q-terms and d-terms (Fig. 3). It then applies n-gram convolutions to the similarity matrix to inject context-awareness, and then pooling to create document-aware q-term representations. Instead, POSIT-DRMM relies on the fact that the term encodings are now already context sensitive (Eq. 1) and thus skips the n-gram convolutions. Again, this is a choice of when context is injected – during term encoding or after computing similarity scores.

Multiple Views of Terms (+MV)

基于ABEL-DRMM和POSIT-DRMM模型的一种改进方案是结合多种term encoding，例如term embedding, term context-sensitive encoding, context-insensitive encoding等信息。具体实例见Fig. 6。

_这种思想其实可以用到所有的深度模型中，值得学习_

Experiments

作者在做实验时，为了避免待匹配的文档数过多，先使用了一个传统的基于BM25算法的IR系统（Galago）选取前topN个文档，之后再用深度模型对这前topN个文档进行重排序。

作者在BioASQ和TREC Robust 2004两个数据集上做实验，结果如下：

本文所使用的评价指标（参考自https://blog.csdn.net/u010138758/article/details/69936041）：

Precision：准确率(精确度)是指检索得到的文档中相关文档所占的比例，公式如下：

precision@20表示检索20个文档中包含相关所占的比例，即：|{retrieved documents}| = 20。
Mean average precision(MAP)：通常会用多个查询语句来衡量检索系统的性能，所以应该对多个查询语句的AveP求均值，即：
Discounted cumulative gain(DCG)：在MAP计算公式中，文档只有相关不相关两种，而在nDCG中，文档的相关度可以分多个等级进行打分。

从实验结果可以看出：

本文中提出的几种改进模型均比原始的DRMM，PACRR更优；
基于context-sensitive encoding的ABEL-DRMM和POSIT-DRMM优于DRMM-PACRR模型；
MV方法能够改善模型效果。

Helic He

NLP

Deep Relevance Ranking Using Enhanced Document-Query Interactions

Introduction

Previous Model